scMNMF: a novel method for single-cell multi-omics clustering based on matrix factorization

Abstract Motivation The technology for analyzing single-cell multi-omics data has advanced rapidly and has provided comprehensive and accurate cellular information by exploring cell heterogeneity in genomics, transcriptomics, epigenomics, metabolomics and proteomics data. However, because of the high-dimensional and sparse characteristics of single-cell multi-omics data, as well as the limitations of various analysis algorithms, the clustering performance is generally poor. Matrix factorization is an unsupervised, dimensionality reduction-based method that can cluster individuals and discover related omics variables from different blocks. Here, we present a novel algorithm that performs joint dimensionality reduction learning and cell clustering analysis on single-cell multi-omics data using non-negative matrix factorization that we named scMNMF. We formulate the objective function of joint learning as a constrained optimization problem and derive the corresponding iterative formulas through alternating iterative algorithms. The major advantage of the scMNMF algorithm remains its capability to explore hidden related features among omics data. Additionally, the feature selection for dimensionality reduction and cell clustering mutually influence each other iteratively, leading to a more effective discovery of cell types. We validated the performance of the scMNMF algorithm using two simulated and five real datasets. The results show that scMNMF outperformed seven other state-of-the-art algorithms in various measurements. Availability and implementation scMNMF code can be found at https://github.com/yushanqiu/scMNMF.


Support materials 1 Optimization of the scMNMF model
Since the scMNMF objective function is non-convex, we use an iterative strategy to solve it, where one variable is optimized by fixing the other variables, and the iterations continue until the algorithm converges to reach the termination standard.The objective function is as follows (1) let Φ 1 , Φ 2 , Φ 3 be the Lagrange multiplier for constraints W ≥ 0, B ≥ 0, F ≥ 0, then Lagrange L of scMNMF objective function is Considering the presence of the L1-norm, here we use the ADMM (Alternating Direction Method of Multipliers) algorithm to optimize this problem.Take into account the Karush-Kuhn-Tucker (KKT) conditions, and then calculating the partial derivatives of W, H k , B, F , respectively, we have (2) where σ, σ 1 > 0 are penalty parameter, T, T k are Lagrange multiplier.Then we can get the update rules and summarize the algorithm as follows

Convergence analysis
We conducted a convergence analysis on the iterative formulas of each variable obtained by the alternating iterative method.For each variable, we discussed its convergence and attempted to find the optimal solution.Here, we will only discuss W, and the same applies to H k , B, andF .Construct a auxiliary function G(W, W t ), let diagonal matrix take the second-order Taylor expansion of F (W ).
Algorithm 1 Optimize our proposed model by alternatingly updating 1: Input: single-cell multi-omics data matrix X k , parameters p, k 1 , α. 2: Initialize: Initialize W, H k , B, F , maximum number of iterations and stop error.3: Iterate the following processes until convergence: The following is a proof that matrix M ij is positive semi-definite. for we have Consequently, it converges in accordance with the iterative process.

Parameter selection
The parameters p, k 1 represent the number of latent variables used for dimension reduction and the number of clusters (i.e., the number of cell types).For the parameter p, we use the instability-based NMF model to calculate the specific value.scMNMF algorithm runs t times with random initial solutions and gets t basis matrices, denoted as M 1 , • • • , Mt. Support two matrices M 1 and M 2 , a t × t matrix G is defined where the element g ij is the cross correlation between the i-th column of the matrix M 1 and the j-th column of matrix M 2 .Then the dissimilarity between M 1 and M 2 is defined as Where g •j denotes the j-th column of matrix G.The instability is the discrepancy of all the basis matrices for p, which is defined as The p corresponding to the minimal γ(p) is selected as the number of rows in G.For the parameters k 1 , considering partial label information is known, then the number of clusters can be directly taken as the value of parameter k 1 .We use cross validation to find the optimal values for µ i , however, all µ i have been set to 0.5 in the experiment to save computational time since the the clustering performance is insensitive to the parameter.The parameters λ k and α are used to balance the strength of the fitting term and the regularization term.Similar to the DRjCC model, we define the values of the two parameters as follows: Then λ k and α are automatically updated according to this formula.

Data pre-processing
Before the experiment, we preprocessed the input single-cell multi-omics data matrix X k with the following steps: 1. Delete genes that are not expressed on all cells.. 2. Filtering out genes that are only present in a small number of cells.

Informative gene selection
The selection of informative genes involves determining how many informative genes to choose and how to extract them.Here are the steps we take to select informative genes: 1. Normalize the matrix H such that the mean of each row is 0. 2. Calculate the eigenvalues and corresponding eigenvectors of the matrix HH T .Arrange them in descending order of eigenvalues, denoted as λ 1 > • • • > λn and y 1 , . . ., yn respectively.3. Select the number k of informative genes: For the selection of k, we choose the value of k such that the cumulative contribution of the largest k eigenvalues is greater than a threshold δ, i.e., k = arg l l i=1 λ i / n i=1 λ i ≥ δ.Here, we set the threshold δ to 0.95.4. Define the weight of the i-th gene as β i = k j=1 λ j y ji , and select the top k genes based on the weights in descending order.We selected 28 informative genes from the 10X_10K dataset using this approach, which refer to Table 1 for the specific gene names.

Fig. 4 .
Fig. 4. Visualization of cell clusters before and after dimension reduction of the other four datasets.

Table 1 .
Maker gene for 10X_10K dataset, where the bolded genes indicated a significant correlation with the survival time of patients.